Programming Model

Host controls the entire flow of the program.

Let us look at how the program flow looks like:

#include<iostream>

int main()
{
//host code

//do some Sequential stuff (execution on CPU)

//launch kernel for parallel tasks.(execution on GPU)
Kernel_0<<<grid_size,block_size>>>(arg1,arg2 , .., argN);


//note that the CPU does not wait for device to finish the kernel_0 execution, one has to tell the compiler to do so. 
//do some Sequential stuff


//do more Sequential stuff (execution on CPU)

//Launch another kernel1 
Kernel_1<<<grid_size1,block_size1>>>(arg1,arg2...);
//do some Sequential stuff without waiting unless told to do so.

return 0;
}

Kernel launch syntax:

kernel_name<<<grid_size,block_size>>>(arg1,arg2, ..);

Note that we must specify the grid size(which defines the organisation of blocks within the grid), and the block size(which defines organisation of threads within the block).

The blocks and grid dimensions are given be variables of CUDA data structures already defined by the name "dim3".

dim3 is data structure of CUDA, which will store integer values for respective objects.

Example:

	dim3 grid_size(x,y,z); //x number of blocks in x direction, y number of blocks in y direction, and z number of blocks in z direction. 
	dim3 block_size(x,y,z);//x number of threads in x direction, y number of threads in y direction, and z number of threads in z direction.

The default values are (1,1,1), i.e. 1 block in x, 1 block in y and 1 block in z direction.
For block_size=(1,1,1), it will represent, 1 thread in x, 1 thread in y, and one thread in z direction.

Once grid and block dimensions are specified, kernel can be launched by following methods.

int main()
{
dim3 grid_size(4,3); //in total there will be 12 blocks
dim3 block_size(5,5); //each 12 blocks will have 5 threads in x and 5 threads in y direction.
kernel<<<grid_size,block_size>>>(arg1,arg2); // this is how we launch the kernel.
}

Host and device are in different memory regions. In order to operate on any data inside a kernel, we first need to allocate memory on device, then we copy the data from host to device, we then launch the kernel, in the end we copy the data back from device to host.

It can be seen as following algorithm:

int main()
{
//some Sequential code

//Allocate memory on device
cudaMalloc();

//copy data from host to device
cudaMemcpy();

//launch kernel
kernel<<<grid_dim,block_dim>>>(arg1,arg2);

//some Sequential stuff (meanwhile, if there is any)

//copy data from device to host
cudaMemcpy();
}

Allocating memory in CUDA is analogous to C, recall that in C we use :

Allocate memory in C: malloc();
De-allocate memory in C: free();

To allocate memory in CUDA, we use :

cudaMalloc(Location, SIZE);
1. Location: memory location on device to allocate memory(it is an address in GPU memory)
2. 2nd argument : Number of bytes to allocate.

To De-allocate memory in CUDA:

cudaFree();

To transfer data between source and device, we use:

cudaMemcpy(destination, source, numBytes, direciton);`

Here the arguments are:

destination: It is pointer to address of the destination in which data is copied to.
source: It is pointer to address of the data which is being copied
numBytes: It is size of the data in bytes being copied. {numBytes= N x sizeof(type)}
direction : It takes one the two:
1. cudaMemcpyHostToDevice
2. cudaMemcpyDeviceToHost

int main()

{
//first step is to declare the pointer variables which points to address of data/variables on host and device, to differentiate the host varaibles from device variables, we use the following convention. 

int *h_a, *d_a;
int number_of_integers=100;
for (int i=0;i<100;i++)
{
h[i]=i; //we can use pointers as array, for more find links below.
}
//here d_a is pointer variable which will hold address of all variables on the device.

//Allocate memory on device
cudaMalloc((void**)&d_a,number_of_integers *sizeof(int)); // 

//copy the data from host to device
cudaMemcpy(d_a,h_a,number_of_integers*sizeof(int), cudaMemcpyHostToDevice);

//define grid and block dimensions
dim3 grid_size(1);
dim3 block_size(number_of_integers);

//launch the kernel
kernel<<<grid_size,block_size>>>(arg1,arg2);

//copy the results back to host from device
cudaMemcpy(h_a,d_a,number_of_integers*sizeof(int), cudaMemcpyDeviceToHost);

//deallocate memory
cudaFree(d_a);
free(h_a);
return 0;
}

Some useful links:

Introduction

Kernel launch syntax

Kernel Launch

CUDA program flow

Memory Allocating and copying

Example Program